Skip to content

[chore] RFC: Count PartialSuccess in exporterhelper#15280

Open
braydonk wants to merge 1 commit into
open-telemetry:mainfrom
braydonk:rfc_partial_success
Open

[chore] RFC: Count PartialSuccess in exporterhelper#15280
braydonk wants to merge 1 commit into
open-telemetry:mainfrom
braydonk:rfc_partial_success

Conversation

@braydonk
Copy link
Copy Markdown
Contributor

@braydonk braydonk commented May 8, 2026

Description

This RFC is to introduce proper accounting for PartialSuccess accounting in a way that aligns with the definition of partial success in the OTLP spec.

Prior art: My first attempt to do this was in #14152, but I went back to the drawing board after realizing it was insufficient compared to the OTLP spec and there was some unrelated refactor to the obs_report_sender code.

Link to tracking issue

#13423
#14440

AI Usage Disclosure

Minor Antigravity usage in the linked branches in the Implementation Plan section. RFC itself was handwritten.

This RFC is to introduce proper accounting for `PartialSuccess`
accounting in a way that aligns with the definition of partial success
in the OTLP spec.
@braydonk braydonk requested review from a team, bogdandrutu, codeboten, dmitryax and mx-psi as code owners May 8, 2026 21:22
@github-actions github-actions Bot added the rfc:approvals-needed This RFC needs approvals from collector-approvers label May 8, 2026
@braydonk braydonk changed the title RFC: Count PartialSuccess in exporterhelper [chore] RFC: Count PartialSuccess in exporterhelper May 8, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 8, 2026

Merging this PR will improve performance by 48.81%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 2 improved benchmarks
✅ 5 untouched benchmarks
⏩ 76 skipped benchmarks1

Performance Changes

Benchmark BASE HEAD Efficiency
BenchmarkMemoryQueueWaitForResult 74.9 µs 50.3 µs +48.81%
BenchmarkPersistentQueue 205.2 µs 148 µs +38.67%

Comparing braydonk:rfc_partial_success (613d4b5) with main (91b32ef)

Open in CodSpeed

Footnotes

  1. 76 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.23%. Comparing base (91b32ef) to head (613d4b5).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #15280   +/-   ##
=======================================
  Coverage   91.23%   91.23%           
=======================================
  Files         703      703           
  Lines       45902    45902           
=======================================
  Hits        41878    41878           
  Misses       2820     2820           
  Partials     1204     1204           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


Two new error APIs will be introduced:

### `Countable` error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we need this one... What use cases we want to cover where this one should be used instead of PartialSuccess?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's possible to reduce this just to PartialSuccessError, but since this is also going to be used by the prometheus receiver (see #14440) I thought the PartialSuccess error, which is so tied to exporter use cases, wouldn't make sense. Wanted to make it flexible so they could report any error they might want and simply wrap it in a count.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason why the Prometheus receiver couldn't use a partialsuccess error, as long as we can provide the failed telemetry count somehow.


#### Why does it implement status code `OK`?

A `PartialSuccess` is nominally a success according to the spec. As a result, when `exporterhelper` processes this it needs to know that the Go error it received doesn't actually correspond to a real failure to send data.This will allow accurate setting of the span status (it shouldn't be `Failed` upon `PartialSuccess`) and determining of `error.type` (a new custom type called `Partial_Success` would be introduced).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to clearly define if we want this error to be propagated by the exporter. If we agree that exporter clearly reports the failed portion as dropped in the internal telemetry and returns nil, we don't need to include GRPC status code resolution. However, in order to decide on this, we need to clearly outline how the internal telemetry will be affected in the exporter and upstream components (if we propagate)

Copy link
Copy Markdown
Contributor Author

@braydonk braydonk May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error needs to be propogated by the exporter for sure, I can't think of any nice way around that. This means that the GRPC status code resolution is then necessary for obs_report_sender.go to count the error with an appropriate error.type and recognize not to mark the span as failed.

The rest of the error API was designed such that the error could be worked with as any other permanent error when any other part of exporterhelper other than obs_report_sender.go receive it.


Two new error APIs will be introduced:

### `Countable` error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this public surface?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error that can any other error can be wrapped in along with a count of failures. This will allow for cases where producers of an error want to include an amount of failures.

what is the unit for "failures"? May not be bad to have this as "PartialError" and map 1:1 the PartialSuccess.

1$ question: In case of a connector (like traces to metrics) if we fail and get back that we failed to export 2 metrics. For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed". How do we treat this case?

Copy link
Copy Markdown
Contributor Author

@braydonk braydonk May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this public surface?

Do you mean does the interface need to be public? Not necessarily, I could just make a public function func GetErrorCount(err error) (errCount int, isCountable bool). I thought that was more awkward than the current API of getting the error as xconsumererror.Countable and then calling the Count method on it. But the former makes for less public surface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the unit for "failures"?

In the error itself I intentionally exclude unit. The unit would be decided by whoever is consuming the error to record failures. In my unfinished otlp exporter draft this ends up working out based on the fact that the exporter itself has a signal type, so when obsReportSender is recording failures for a logging otlp exporter, that failed_sent metric already has a unit of {log}.

Copy link
Copy Markdown
Contributor Author

@braydonk braydonk May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a connector (like traces to metrics)

I hadn't considered connectors, since I was only thinking of exporters for this RFC. However I think most of the design will still fit in, but the big open question of how to record this for more complex scenarios like fanoutconsumer was blocking the ability to at least solve this for exporterhelper and receiverhelper, where the case is more clear-cut.

(I'm not sure I understand how error metrics are supposed to be recorded at all for fanoutconsumer scenarios, let alone being able to report partial counts, but I think if the error recording behaviour is clear for fanoutconsumer the partial part is relatively simple to slot in).

For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed".

Are you talking about the scenario where something failed downstream from the metrics post-conversion? Assuming the simple scenario of 1-to-1 traces to metrics and no fanout, I see two scenarios:

  1. The trace/span failed to be converted to a metric or was refused by the conversion process in some way. This seems pretty clear that "1" failed item is "1 failed span in conversion" and would be 1 refused span that obsconsumer could potentially record by recognizing xconsumererror.Countable here.
  2. The metric failed post-conversion by something down its pipeline. I think it's essentially impossible to determine whether a failed metric in that pipeline somehow lines up with a particular span that was converted to a metric. I think it's fair to say that the tracestometrics connector did its job, and considers all spans successful, and if the metric fails after successful conversion then that's for components further down the line to properly report.

I wasn't part of any connector design discussions that happened or may be happening, so the above is based on my own conclusions by reading and not by any definitive explanation given to me.

Comment thread docs/rfcs/partial-success-counting.md

A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures.

#### Why should the internal error be permanent?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@open-telemetry/technical-committee is this a correct read of the protocol definition? I personally think it is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

However, I think the protocol specification can be improved/fixed to allow the service to provide more information regarding the partial success, which would allow the client to retry the items that failed to be delivered if these are retriable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is the correct read. The OTLP spec says:

The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.

The partial success response signals that some items are successfully received and some others are bad and cannot be received, so no retries are needed for either.

Copy link
Copy Markdown

@hilmarf hilmarf May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and some others are bad and cannot be received, so no retries are needed for either.

@tigrannajaryan What could be a reason for being bad? Why couldn't they be received? (We scenarios, where it's not acceptable to have kind of unknown failures)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be a reason for being bad?

Any violation of requirements defined in https://github.com/open-telemetry/opentelemetry-proto/tree/main/opentelemetry/proto could be the reason the data cannot be received.

Copy link
Copy Markdown
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, needs more polish on some edge cases.

* Fully counting data drops throughout the pipeline
- The new `Countable` interface does open the possibility of counting partial failures at various points throughout the pipeline, which might be worth exploring down the line. The only thing I want to address right now is ensuring these are properly counted at the exporter metric level.
* Partial retries
- This is purely covering scenarios that align with the OTLP spec's definition of `PartialSuccess`, which does not come with any functionality to determine exactly which items failed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that consumererror can currently be used to trigger partial retries in the retry_sender, won't it be necessary to continue to support both:

  1. partial successes that can be partially retried and counted partially once retries fully fail
  2. partial successes that can only be counted

Otherwise, does it mean we will have behaviour where consumererror.NewTraces(err, failedTraces) indicates partial failure but it isn't counted partially?


### `PartialSuccess` error

A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to implement gRPC status code resolution? It looks like FromError looks for wrapped errors as well (using errors.As: https://github.com/grpc/grpc-go/blob/cb18228317ff523e63d931b4058b0329585b7dcd/status/status.go#L113). So if, for example, the OTLP exporter wants to provide the gRPC status code OK in its PartialSuccess error, it should just be able to use standard Go error wrapping to do so.


This API may end up being useful in other places than `PartialSuccess`, such as [in `receiverhelper`](https://github.com/open-telemetry/opentelemetry-collector/issues/14440).

### `PartialSuccess` error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error meant to specifically be the OTLP "PartialSuccess" error? Or is this meant to be a collector type that means "The operation succeeded overall, but some parts failed"? If it is OTLP-specific, it shouldn't be in exporterhelper, and shouldn't be used by other components. If it is a generic "this thing partially succeeded" error, then it shouldn't do things like set gRPC status codes, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc:approvals-needed This RFC needs approvals from collector-approvers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants